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Abstract 

This work studies the theoretical rules of feature selection in linear dis- 
criminant analysis (LDA), and a new feature selection method is proposed 
for sparse linear discriminant analysis. An l\ minimization method is used 
to select the important features from which the LDA will be constructed. 
The asymptotic results of this proposed two-stage LDA (TLDA) are studied, 
demonstrating that TLDA is an optimal classification rule whose conver- 
gence rate is the best compared to existing methods. The experiments on 
simulated and real datasets are consistent with the theoretical results and 
show that TLDA performs favorably in comparison with current methods. 
Overall, TLDA uses a lower minimum number of features or genes than other 
approaches to achieve a better result with a reduced misclassification rate. 
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1. Introduction 

Classification in high-dimensional data is a common problem which has 
created new challenges for tradition al statistical meth ods. For instance, 



the classification of leukemia data (IGolub et all Il999l ) is a classic high- 



dimensional example in which there are 7129 genes and 72 samples coming 
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from two classes. Due to the small sample size n and large sample dimen- 
sion p, which are often referred to as "large p, small n" data, estimators of 
the samp le mean and covar i ance m atrix are usually unstable. In a seminal 
paper by iBickel and Levinal ( 120041 ). linear discriminant analysis (LDA) was 
proved to be no better than a random guess when p/n — > oo. In the litera- 
ture, researchers have proposed two classes of independent rules to deal with 
high- dimensional classification. 

A natural method is to ignore the dependence am ong the varia b les and 



th is leads to the so- c alled naive Bayes classifier, see iDudoit et al.l ( 12002ai ) 



or 



Bickel and Levinal (I2004J ) for more d etails. This independ e nt rule has also 



been w ell stu died in many works such as lDudoit et al.l (j2002b| ) , iTibshirani et al. 
( 120021 ). and iBarry et al.l (120051 ) . However, the correlation ignored by the 
naive Bayes classifi er may be yery important for classification. This is par- 
tially evidenced by lFan et al.l (120121 ). who comment that the theoretical mis- 
classification rate of the naive Bayes classifier is higher than that of Fisher's 

rule unless the true population covariance matrix is dia gonal. 

An alternative approach involves individual analysis. iFan and Fan! (120081 ) 
proposed using the two-sample t-statistic to select features. For every feature, 
a t-score is calculated a nd the features are chosen b y their t-scores. Similar 
rules c an als o be found in lZuber and Strimmerl ( 120091 ) , ITibshirani and Wasserman 
(l2006h . and lLail (120081 ). In lFan and Fan! d2008h . the authors proved that the 
two-sample t-statistic could pick up all the differently expressed features. 
However, those differently expressed features may not be the best features 
for classificat ion unless the t rue population covariance matrix is diagonal. 
For example, IWu et al.l (120091 ) pointed out that in gene analysis, most genes 
are not expressed sufficiently differently that they can be detected by the 
t-statistic. 



Fan et al.l ( 120121 ) and iMai et al.l (120121 ) found that the above rules could 



result in misleading feature selection and inferior classification based on fea- 
ture selection by the t-statistic or the ignorance of correlations among fea- 
tures. As also pointed out in IWu et al.l ( 120091 ). there is often a group of 
correlated genes in gene expression analysis in which correlations cannot be 
ignored, and the covariance information can help to reduce the misclassi- 
fication rate. Assuming that the population co yariance matrix a nd mean 
are sparse, a thresholding procedure is used in IShao et al.l (120111 ) to esti- 
mate parameters and plug these estim ators into the LDA. A constrained l\ 
minimization method is introduced in ICai and Liul (l201ll) to estimate th e 
classification direction, and other methods include those of IWu et al.l ( 120091 ). 
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Tone et all (I2012h. Mai et all (12012i ). Fan et all (120 12f ). iLi et all (1200 lh . and 



Goeman et al 



Soil 



Just as [Fan and Fan! (120081 ) commented, the difficulty of high- dimen- 
sional classification is intrinsically caused by the existence of many noise 
features that do not contribute to the reduction of the misclassification rate. 
Thus, if we can select a subset of important features, the high-dimensional 
classification will become manageable. In gene expression, especially in di- 
agnostic tests , selec ting signature genes for accurate classification is essential 
(jYeung et all 120121 ). In this article, we study a theoretical rule to capture 
the discriminant features for classification. Generally, the best s features 
for classification are those having the same (or almost the same) theoretical 
misclassification rate as all p features. When the true linear discriminant 
direction is sparse, we can select a subset of features having the same mis- 
classification rate as all p features. For the asymptotic sparsity situation, the 
misclassification rate based on our selected features is also close to the theo- 
retical misclassification rate. Our results show that the main condition used 



in 



Fan et all (I20121 ). ICai and Lid (I20111 ). Mai et all ( 120121 ). and IShao et al 



( 1201 ll ) ensures that such a small subset of important features which can be 
selected to derive a more stable and accurate classification result does exist. 

In this work, a two-stage LDA (TLDA) is proposed to learn high- dimen- 
sional data. TLDA uses l\ minimization, which is a linear program for select- 
ing important features; LDA will then be constructed based on these selected 
features. Asymptotic results of the proposed TLDA are studied where the 
consistency and convergence results a re given. Experim e nts show tha t, unde r 



the same regu l arity conditions as in iFan et al.l (120121 ) . ICai and Liu! ( 120111 ). 
and lMai et al.l (120121 ) . TLDA achieves a better convergence rate. Simulation 
studies and experiments on real datasets support our theoretical results and 
demonstrate that TLDA outperforms existing methods. 

The rest of the paper is organized as follows. In Section 2, we investigate 
the theoretical rule of choosing features and the asymptotic results. Evalu- 
ations in simulated data are included in Section 3. In Section 4, TLDA is 
applied to three real datasets to demonstrate its performance on real data. 
Finally, we conclude the article in Section 5. All the proofs are given in 
Appendix. 
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2. Methods 



Let X be a p- dimensional normal random vector belonging to class k 
if X ~ 7V p (/ifc,E), fc = 1,2, where /ii 7^ yU 2 , and E is a positive definite 
symmetric matrix. If fii, fi2, and S are known, the optimal classification rule 
is Fisher's linear discriminant rule 

5 F (X) = I{(X - /x a ) T E-Vd > 0}, (2.1) 

where fx a = {fii + ^2)/ 2, fid = {fi-y — fi?)/2, and J denotes the indicator 
function with value 1 corresponding to classifying X to class 1 and to class 
2. Fisher's rule is equivalent to the Bayes rule with equal prior probabilities 
for two classes. The misclassification rate of the optimal rule is 

i? = l-$(Af), A p = /4E-V d , (2.2) 

where $ is the standard normal distribution function. 

In practice, Fisher's rule is typically not directly applicable because the 
parameters are usually unknown and need to be estimated from the sam- 
ples. Let {Xij,j = 1, • • • ,ni} and {X 2 j,j = 1, • • • ,112} be independent 
and identically distributed random samples from N p {fii,Y) and N p (fi2,^), 
respectively. The maximum likelihood estimators of fi\, f/,2, £ are 

x k = — y X lj: k = 1,2, 

^ 2 n k 

I V 

k=l j=l 

where n = ri\ + n 2 , and setting 

X\ + X2 „ X\ — X2 

— , f>>d 



2 ' r " 2 

and = 5" 1 (or generalized inverse S~ when S*" 1 does not exist), Fisher's 
rule becomes the classic LDA 

5 LDA (X) = I{(X - flafS^fld > 0}, 

and the misclassification rate of LDA based on sample {Xij, j = 1, • • • ,^1} 
and {X 2J ,j = 1, • • • ,n 2 } is 

R (Aa-^l)^n 1 ^ v (fl a - fl 2 )S- l fld v 

KLDA 2 l (/i^-^-^)V2^ + 2 1 (fiS?VS?MW } ' 
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whic h has been well studied when p is fixed; more details can be obtained 
from lAndersonl (120031 ). 

For classification, the best s features are those with the largest A a , where 
A s is the counterpart of A p . We begin with basic notation and definitions. 



For a vector a = 
Y7j=i \ a j\i ancl l a b 



ai, • ■ ■ ,a p ) T , we define |a|o = X]j=i^( a j 0)' l a k 
Yl P j=i a ]- For an y m dex set A C {!,-•• ,p}, A c 



{j G {1, • ■ ■ ,p} : j A} and C is denoted as a constant which varies from 
place to place. For any two index sets A and A' and matrix B, we use Bjj^i 
to denote the matrix with rows and columns of B indexed by A and A'. 
For a vector b, b A denotes a new vector with elements of b indexed by A. 



In particular, A^ 



(A*)5(E- 



which dominates the theoretical 



misclassification rate if we only use features corresponding to index set A. 

The following propositions give solutions to the feature selection problem. 
Here and below we write /3 = 2E~ 1 / u ( 2- 



Proposition 2.1. Let A = {k : ((3 Q ) k ^ 0}. We have 



(2.3) 



Proposition 12.11 means that the best features are indexed by the support of 
/3q . If /3q is approximately sparse, which means that many entries of (3q are 
very small, we have the following result. 

Proposition 2.2. Assuming that there is a constant Cq (not dependent onp) 
such that — < all eigenvalues of S p < Co and there exists Ai C {1, 2, • ■ ■ ,p} 
satisfying s p = ^ keA c \ (Po)k\ 2 -> ; we have 



A„- A 



Ai 



0{s 



pj- 



(2.4) 



Propositions 12.11 and 12.21 provide the theoretical foundations for choosing 
features, and next we will study how to recover the support of 0q from the 
samples. In other fields, such as compressed sensing and high- dimensional 
linear regression, constrained U minimization has been a common method fo r 



reconstructing a sp arse signal (|I)onoho et all l2006t ICandes and Tad . 120071 ) 



In a recent work by ICai and Liul (120111), the authors applied L minimization 
to estimate (3o directly. However, as ICandes and Tad (120071 ) pointed out, 
a two-stage l\ minimization procedure tends to outp erform the pract i cal re - 
sults; more details can be found in the discussions in lCandes and Tad ( 120071 ). 
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Motivated by this, we use l\ minimization in our work to select features and 
construct LDA on those selected features. 

First, to ensure the identifiability of the important features, we assume 
that there exists A C {1,2, •■■ ,p} satisfying p = \A\o = o(y/n/ log p), 
{Po)a c — 0, and min/^ |(A))fc| > c p . Based on the samples, we first consider 
the li minimization method, 

$ E arg min{|/3|i subject to \S n f3 - {X x - X 2 )\ 00 < A n }, (2.5) 

where A n is a tuning parameter. Second, important features will be selected 

as 

-4* — {j '■ \Pj\is among the first largest p of all}. (2.6) 

Before introducing the asymptotic properties of TLDA, we specify the fol- 
lowing regularity conditions 

Co 1 < ni/n 2 < c , c^ 1 < A min (E p ) 

) < c , 

logp < n, A p > Cq 1 for some constant c > 1, (2.7) 

which are commonly used in high-dimensional settings. Our first result is 
the consistency of A* = A. 

Theorem 2.1. Let X n = Ca/A p log p/n, with C > being a sufficiently 



large constant. Suppose that (2.1) holds and that c^/ (A p p ^log p/n) — > oo. 
Then 

P(A* = A) = 1- 0(p- 1 ). (2.8) 

From f)2.8p . we know that the truly important feature set A will be indexed 
by A* with a high probability. If the LDA is constructed on those selected 
features, the following results demonstrate the explicit convergence rate of 
the misclassification rate based on features A*. 

Theorem 2.2. Under the assumption of Theorem \2.1\ and applying LDA to 
features A*, denoting the corresponding misclassification rate as Ra* , then 
the following hold. 

(1) R^* — R — > in probability. 

(2) If further assuming A p p 'log po/n — > 0, 

^ - 1 = O(p A pV /logp /n), (2.9) 
with probability greater than 1 — 0(p~ l ). 



6 



Remark 2.1. According to Definition 1 o uShao et all \201 A) , with probability 
greater than l—0(p~ 1 ), TLB A is asymptotically optimal when A p p A/log po/n 
0. Furthermo r e, the c onditions in Theo rems l2~l\ andUTB are simila r to those 



Fan et al. h01& ). \Mai et all hOlA ). and \ Cai and Liu 1201 n), but 



our 



in 

metho d has a better convergence rate. For example, Theorem 3 in \Cai and Liu 
1(201 A) shows that R n /R — 1 = O (p A p ^y\og p /n) . Noting thatp <C p, there- 
fore our results outperform theirs in this case. This means that, compared 
with estimating (3q directly, our two-stage method improves the results in the- 
ory. 



3. Simulations 

In practice, the final LDA depends on parame ters A n which can be selected 
by maximizing the cross-validation (CV) as in ICai and Liul (j201l[ ) and po, 
which can also be selected by CV. Our algorithms are outlined below. 

Algorithm 1 A Two-stage LDA based on li minimization 
1 

2 
3 



Calculating the sample covariance matrix S n and mean Xk, k = 1,2; 
p\„ = argmin^p YZ=i \Pk\ subject to \S n /3 - {X x - X 2 )\oc < A n ; 
Denoting the tuning parameters chosen by five- fold CV as X n and po- 
Here we adjust A n as A = a/4/5A„; 
A* = {j : \Pj\is among the first largest p of all}; 
13* = {(SnU^r^iX^A* - (X 2 ) A *); 

If (Y - (Xi + X 2 )/2) T A *P* > 0, classifying Y to class 1, else class 2. 



The reason for adjusting A n as A = ^ ] /A/5X n is due to A n = CU A p log p/n, 
and the fact that the sample size is 4n/5 but not n in five-fold CV. The simu- 
lations reported in Table 4 of lCai and Liul fl201l[ ) also support our adjustment 
here. Furthermore, the l\ minimization is a linear program which is very at- 
tractive for high-dimensional data and can be implemented by many existing 
programs, such as the function linprogPD included in the R package "clime" , 



which is available at http : //cran. r-project . org/web/packages/clime/index .html 

We now present the results of simulation studies which were designed to 
evaluate the performance of the proposed TLDA. For the purpose of com- 
parison, we also apply several other methods to the da ta, specifically, linear 
programming discriminant f LPD) (|Cai and Liul. 120111). regula rized optimal 



affine discriminant (ROAD) flFan et all . 120121 ; IWu et all . I2009T ). and the ora- 



cle Fisher's oracle rule (Oracle). The oracle rule is included as a benchmark. 
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The LPD will be solved by the R package clime and the matlab code for 

ROAD is available at http: //www.mathworks . com/mat labcentral/f ileexchange/40047 , 

In simulations, we fix the sample size n\ = ri2 = 100 and without 
loss of generality we set fi2 = 0. For the true classification direction (3q, 
(A))[(2fc-i)/io] — (— l) fc+1 (fc + l)/4, k = 1,...,5 and all other elements are 
zero. Two kinds of population covariance matrix will be considered. 



Model 1. E 



(o\j)px P , where = 0.8 1 * j| for 1 < i, j < p. 



Model 2. E = (<Jij) pxp , where = 1 for 1 < i < p and Cy = 0.5 for 
«V 3- 



roposed TLDA 
. The average 



performance of our i 


Fan and Fan 


> 


200811 



method and the two-sample t-statistic (IFan and Fan 
misclassification rates based on 100 simulations are reported in Fig. [TJ and 
here p = 100. The figure shows that TLDA always selects more useful fea- 
tures than the two-sample t-statistic, which ignores the correlation between 
features. Specifically, due to correlations, features 30 and 70 cannot be de- 
tected by the two-sample t-statistic for Model 2. 

In the sec ond simulatio n , we s tudy t he misclas si ficatio n rate of our TLDA 
method. In ICai and Liu! (120111 ) and IFan et al.l (120121 ). the authors have 
conducted many numerical investigations to compare their methods with 
others, including; th e oracle features annealed independence rule (OFAIR) 



(IFan and Fanl . 120081 ) and nearest shrunken centroid (NSC) method (ITibshirani et al. 
20021 ). and concluded that their methods perform better. We therefore com- 
pare TLDA only with LPD and ROAD and do not consider other classic 
methods. Table [1] shows the misclassification rates based on 100 replications 
for TLDA, LPD, ROAD, naive Bayes (NB) and Oracle. 

From Table [TJ we can see that the performance of TLDA is similar to that 
of Oracle and is better than that of the other methods. Clearly, due to its 
fundamental drawback, the naive Bayes is the worst of all methods although 
it is better than random guess (whose misclassification rate is 50%). Overall, 
compared with LPD and ROAD, TLDA has the smallest misclassification 
rate, and the standard deviation of TLDA is similar to that of LPD but 
smaller than that of ROAD. When the dimensionality p increases from 100 
to 800, TLDA is quite stable, whereas LPD and ROAD become increasingly 
worse. In particular, TLDA always has a smaller misclassification rate and 
standard deviation than ROAD. When p is not large, TLDA and LPD have 
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Model 1 



Model 2 




i 1 1 1 1 n 1 1 1 1 1 

20 40 60 80 100 20 40 60 80 100 

Dimensionality Dimensionality 




20 40 60 80 100 20 40 60 80 100 

Dimensionality Dimensionality 



Figure 1: Plots for TLDA and the t-statistic. Upper: average misclassification rates 
versus number of selected features; Middle: average /3o representing the signal of choosing 
features by TLDA; Lower: average fi\ — ^2 representing the signal of choosing features by 
the i-statistic. 
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Table 1: Average misclassification rates in percentage for sparse situations. Standard 
deviations are given in parentheses. 



V 


TLDA 


LPD 


ROAD 


NB 


Oracle 








Model 1 






100 
200 
400 
800 


13.41(2.68) 
13.31(2.45) 
13.99(2.56) 
14.16(2.94) 


13.58(2.48) 
13.62(2.55) 
14.06(2.69) 
14.93(2.96) 


16.68(5.44) 
16.19(5.05) 
17.45(5.49) 
18.22(5.08) 


16.94(2.64) 
17.18(2.54) 
18.86(2.67) 
20.56(2.92) 


11.59(2.18) 
11.66(2.38) 
11.88(2.39) 
11.74(2.30) 








Model 2 






100 
200 
400 
800 


20.78(3.01) 
20.91(3.26) 
21.49(3.50) 
21.99(3.70) 


21.04(3.14) 
21.58(3.27) 
22.49(3.55) 
23.31(3.75) 


25.01(4.47) 
25.49(3.91) 
26.04(3.88) 
26.62(3.71) 


35.13(3.02) 
35.92(2.76) 
35.87(2.86) 
36.04(3.03) 


18.41(2.66) 
18.55(2.55) 
18.60(2.76) 
18.70(3.13) 



similar performance, while TLDA becomes better than LPD as p increases; 
in particular when p is sufficiently large (such as p = 800), the difference 
between the misclassification rates of TLDA and LPD becomes bigger. In 
summary, simulations demonstrate that TLDA is a stable and superior clas- 
sification method compared to existing methods. 

Next, we will study the estimators Ptlda, Plpd, and Proad- Fig. [2]plots 
the average estimators of 100 replications. Due to different assumptions, here 
we adjust Pro ad to |/3 | 2 * Proad so that it fits the real situation. From Fig. 
[2j we can see that TLDA correctly selects most of those five features but 
very few noise features. In particular, compared with LPD, which estimates 
the true Pq directly, our two-stage estimators are much cl oser to Po, which is 
consistent with the discussions in ICandes and Tad (120071 1 . 

The above simulations are conducted for scenarios where Po is sparse. 
In practice, it is quite common that there are many weak signals that are 
correlated with the main signals. It would be interesting to evaluate the 
performance of TLDA for these approximately sparse situations. Specifically, 
we will consider two scenarios with respect to fi\, as follows. 



Model 3. /ij 



;i 5 , P _ 5 ) in Model 1. 



Model 4. p = 0.551 * (3, 1.7, -2.2, -2.1, 2.55, (p - 5)- 1 l p _ 5 ) and [i x 
£ * p in Model 2. 



Here ni = = 100 and = 0. Model 3 is similar t o those in ICai and Liu 



( 1201 If ) and lFan et all (120121 ). and Model 4 comes from lMai et all (I2012[ ). The 
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Model 1 



Model 2 




• True 
o Estimate 



80 100 




• True 
o Estimate 



20 40 60 80 100 

TLDA 




• True 
o Estimate 




• True 
o Estimate 




• True 
o Estimate 



80 100 




• True 
o Estimate 



80 100 



Figure 2: Average estimators of TLDA, LPD and ROAD for p = 100. The true /3 and 
the estimators are very sparse, which is why there is an almost solid line at zero. 
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average misclassification rates based on 100 replications are reported in Table 
[2J It is again evident that TLDA performs favorably compared to existing 
methods. 

Table 2: Average misclassification rates in percentage for approximately sparse simula- 
tions. Standard deviations are given in parentheses. 



V 


TLDA 


LPD 


ROAD 


NB 


Oracle 








Model 3 






100 


20.70(3.12) 


22.69(3.67) 


26.85(5.91) 


31.46(4.07) 


18.56(2.54) 


200 


20.89(3.11) 


24.03(3.83) 


27.52(5.37) 


33.74(3.68) 


18.98(2.65) 


400 


20.96(3.18) 


25.03(3.77) 


28.03(5.36) 


36.61(3.69) 


18.65(2.59) 


800 


21.75(4.56) 


26.77(4.60) 


28.73(5.14) 


40.71(3.63) 


18.80(2.69) 








Model 4 






100 


11.99(2.68) 


12.30(2.59) 


14.57(3.33) 


21.87(2.68) 


9.98(2.07) 


200 


12.64(2.58) 


13.04(2.67) 


15.15(3.19) 


22.17(2.97) 


10.60(2.06) 


400 


12.70(2.64) 


13.52(2.40) 


15.56(3.09) 


22.28(2.79) 


10.03(2.17) 


800 


12.90(3.01) 


13.85(2.94) 


15.35(3.75) 


22.33(3.11) 


10.08(2.21) 



4. Real data 



In this section, we apply the proposed TLDA to real datasets. Since real 
data usually has a n ultra-high data d imension p, a sure independence screen- 
ing (SIS) method (IFan and Lvl . 120081 ) will be carried out before our proposed 
feature selection procedure to further improve the accuracy and control the 
co mputational cost. For b r evity, we will apply the two-sample t-test statis- 
tic (ITibshirani et all l2002t IFan and Fail 120081 ) to reduce the dimensionality 
from ultra-high to a moderate scale. Other screening steps such as that in 
Fan et all ( 120121 ) can also be used, but we do not pursue them in detail. 

First, TLDA is applied to study leukemia data, which is available at 



http: //www.broadinstitute . org/ cgi-bin/ cancer/datase ts . cgi .~| The dataset 
contains p = 7129 genes for n\ = 27 acute lymphoblastic leukemia (ALL) 
samples and n 2 = 11 acute myeloid leukemia (AML) samples in the training 
set; the test set cons ists of 20 ALL sam ples and 14 AML samples. More de- 
tails can be found inlGolub et all (119991) . By following similar pre-processing 
steps as lDudoit et all (j2002al ) and IFan and Fan! ( 120081 ). we standardize each 



sample to zero mean and S n 
unit diagonal elements. 



\ Sfe=i YJjti( x k,j - X k )(X kd - X k ) T has 
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Table 3: Classification errors of leukemia data by various methods 





TLDA 


LPD 


ROAD 


OFAIR 


NSC 


NB 


Training Error 


0/38 


0/38 


0/38 


1/38 


1/38 


0/38 


Test Error 


1/34 


1/34 


1/34 


1/34 


3/34 


5/34 


No. of Selected genes 


8 


151 


40 


11 


24 


7129 



For comparison with LPD in lCai and Liul ( 1201 ll ). we use 2867 genes with 
the largest absolute values of the two-sample t-statistic /i 2 | > 0.5). Fig. 
[3] shows the mean difference and estimator (3 (tuning parameter A = 1.2), 
representing the feature selection signals of the two-sample t-statistic and 
TLDA, respectively. Clearly, the signal for TLDA is sparse, while the signal 




Figure 3: True mean difference and estimator /3o of leukemia data. 

for the two-sample t-statistic has no clear clues. The classification results for 
TLDA, LPD, ROAD, OFAIR, NSC, and NB are shown in Table [3j 

Table [3] shows that TLDA performs competitively in classification error 
with LPD and ROAD. However, TLDA only selects 8 genes, in contrast to 
40 genes by ROAD and 151 genes by LPD. The 8 selected genes and their 
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Table 4: The eight genes of leukemia data selected by TLDA. 



Gene position TLDA weights Rank of i-statistic 



461 


-3.203 


7 


1779 


-4.455 


87 


1834 


-5.039 


6 


3320 


-0.960 


1 


3525 


-3.876 


138 


4847 


-6.389 


2 


5039 


-1.187 


4 


6539 


-7.9933 


21 



TLDA weights are given in Table HI For comparison, we also present their 
i-statistic rank in the 7129 genes. 

We further compare the met hods on two more real datasets: th e colon 
( ISrivastava and Kubokawal . 120071 ) and breast cancer (IHess et al.l . l2006l ) datasets. 
A leave-one-out cross validation (LOOCV) is performed on the two datasets. 
For i = 1, ■ ■ ■ ,n , the p X 1 vector x« is treated as the testing set, while the 
remaining n — 1 observations form the training set. A subset of 1000 genes 
is selected based on the two-sample t-statistic. The classification results for 
the TLDA, LPD, ROAD, and NB methods are shown in Table |5j We can see 
that, on each dataset, the proposed TLDA has a competitive performance 
in terms of classification errors while using the fewest genes. Overall, TLDA 
is also applicable in real datasets and performs favorably in comparison to 
existing methods. 



Table 5: Classification error and number of genes selected by various methods for the 
colon and breast cancer datasets 





TLDA 


LPD 


ROAD 


NB 


Colon Error (%) 


9.68 


9.68 


11.29 


14.52 


No. of genes 


7.42(1.03) 


168.95(71.39) 


38.10(27.60) 


1000(0) 


Breast Error (%) 


21.80 


25.56 


31.58 


34.59 


No. of genes 


14.61(2.40) 


332.45(103.56) 


44.14(47.26) 


1000(0) 



5. Discussions 

In this paper, we have proposed a solution for feature selection in high- 
dimensional data. We have derived the optimal feature selection rule for LDA 
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and proposed the selection of features based on the sparsity of An li 

minimization method is used on the samples to select the important features 
and LDA is then applied to those selected features. Our proposed TLDA 
performs favorably compared to existing methods in theory and application. 
Our analysis shows that the independent rules such as the two-sample t- 
statistic and naive Bayes may not be efficient and may even lead to bad 
classifiers. 

Suppose that there are K > 2 classes (in this article we assume that 
K = 2), our TLDA is also applicable. For this, X will be classified to class 
k if and only if 

(X - (X k + X z )/2)^/3* > for all k^l. (5.10) 

Moreover, the procedure can be extended to unequal prior probabilities 7i"i 
and 7T2 in which we classify X to class 1 when 

(X - (X, + X 2 )/2) T A ,f3* > log (7r 2 /7n), (5.11) 

where the parameters can also be estimated as iri = n\/n and 7T2 = Tiijn. 
For non-Gaussian distribu tions, we ca n also derive similar results under the 



moment conditions, as in ICai and Liul (120 111 ). 



Finally, we note that the number of selected features is p = o{^Jn/ logp) 
which is very small compared to p. Setting n = 0({\ogpf) for /3 > 1, this 
means that only o((\ogp)^~ 1 ^ 2 ) features can be selected from p variables to 
apply LDA. This is due to the fact that LD A is stable on l y whe n po\/po/ n ~ > 



0, and a detailed result can be found in IShao et al.l (1201 if ). Our future 
research will focus on improving p . 
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Appendix A: Proofs 

A. 1 . Proof of Theorem \2.1\ 



From the proofs of Theorem 2 in lCai and Liul (120111 ). we know that 



- /3 ) T S(/3 - A)) < C\f3 \l^/hgJJ^ + 6\Mi, 



(5.12) 
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with probability greater than 1 — 0(p 1 ). Using the Cauchy-Schwartz in- 
equality, 

\Po\l < \Po\o\Po\l < copoiffim) = 4c oPo A p , 
- /3 ) T S(/3 - A>) > - O ) T - /3o). 

Together with (I5.12p . we have 



- (3 O ) T - fa) < C Po A pV /\ogp/n, (5.13) 
with probability greater than 1 — 0{p~ l ). For j E A, 



lA-(/5o)i| 2 <Cp A P Vlogp/n. 

Then 



0j\ > \(Po)i\-JCp A p ^h^J 



>n 



> Cp(l - \/ CpoAp^/logp/n/cp) 

> c p /2. 

Similarly, for j E A c , 



< V Cp A p y/\ogp/n < Cp/2. 

Hence, we have proved that P(A* = A) = 1 — 0(p _1 ). 

A. 2. Proof of Theorem \2.2\ 

Applying the features selector A* to the sample {Xij,j — 1, • • • , rii} and 
{X 2> j,j = 1, • • • , n 2 }, we still denote the corresponding data as X, {X k j, k = 
1,2} for brevity. It is noted that here the dimension is p$ not p. Setting 



X k = —Y j X l ^ k = 1,2, 

fe i=i 



and 



k=l j=l 



X\ + a Al — 
Ma = ~ , Ad = o • 
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The LDA procedure is 

8 LDA {X) = I{{X - flafS' 1 ^}, 
and the misclassification rate is 



2 ^(ffis-iVS- 1 !*) 1 ' 2 ' 2 v (ffiS-VS-ifa) 1 



/2' 



By the proofs of Theorem 1 in IShao et al.l (120111 ). we know that 
and a similar result also holds for $ ( — ^ a -~i - " - % 1 2 ) • Then 



= $ (-Ay 2 (l + OboVlogpo/n))). 



Noting that po^\ogpo/n — > 0, therefore, in probability, 

Ra* — -R — > 0. 



From equation (12) of ICai and Liul (120111 ). we know that 



$(-A p /2 ) 



11 < 



Then 



1 i? 



1| < O(A p p ^\ogp /n)e°( ApP0 ^ logP0/n) . 



When App A/log po/™ — > 0, we get 



0(A p po\/logpo/^)- 



(5.14) 



(5.15) 



(5.16) 



The proof is completed. 
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